Logistic regression on data including categorical variables


library("dplyr")
library("ProjectTemplate")
load.project()

Realistic data analysis!

Let’s consider a logistic regression problem.
We are given a medical dataset from a health insurance company and want to explain the withdrawal of the insured from the insurance using the following variables: gender, region, age, last year’s costs.

First we have to produce a data set where those variables are well defined.

The data set

Reading the preparatory data sets

library("readr")
cache("Vstichtag20090201",{
  
require(tcltk)
msgBox <- tkmessageBox(title = "User have to do some action.",
                       message = "Please select the data 'Vstichtag20090201'.", icon = "info", type = "ok")

Vstichtag20090201 <- read.table(file.choose(), header=TRUE, sep="\t")
})
## Le chargement a nécessité le package : formatR
##   Skipping cache update for Vstichtag20090201: up to date
classes <- sapply(Vstichtag20090201,class)
head(Vstichtag20090201,3)
##    versnr  txfanr   VEGDAT VEGESC VESPCD   VEEDAT VEECOD VEADAT VEAGR  VERSFA
## 1 1000004 1000004 19281205      M      F 19700701     11      0    NA 1000004
## 2 1000152 1000152 19140830      M      F 19830101     11      0    NA 1000152
## 3 1000276 1000276 19170319      W      D 19721201     11      0    NA 1000276
##   VEKANT VEVERTR VENBSV VEPFNR X_18 X_82 X_46 X_47 X_35 X_20 X_86 X_84 X_85
## 1     GE             NA    122    1    0    0    0    0    0    0    0    0
## 2     VD             NA    110    1    0    0    0    0    0    0    0    0
## 3     BE             NA    157    1    0    0    0    0    0    0    0    0
##   X_83 X_25 X_05 X_03 Online Jahrgang Alter ZielKKnr Stichtag Standdatum A1 AP
## 1    0    0    0    0      0     1928    81       NA 20090201   20090201  1  1
## 2    0    0    0    0      0     1914    95       NA 20090201   20090201  1  1
## 3    0    0    0    0      0     1917    92       NA 20090201   20090201  1  1
##   U Z H3 H2 K H1 F H4 A5 A6 N A4 HT B J X APC S P AE KTI C XL KUTI A3 G T A H
## 1 1 1  0  0 0  0 0  0  0  0 0  0  0 0 0 0   0 0 0  0   0 0  0    0  0 0 0 1 0
## 2 1 1  1  0 0  0 0  0  0  0 0  0  0 0 0 0   0 0 0  0   0 0  0    0  0 0 0 1 1
## 3 1 1  0  1 1  0 0  0  0  0 0  0  0 0 0 0   0 0 0  0   0 0  0    0  0 0 0 1 1
##   ZV nurOKP nurZV beide nurOKP_nurZV_beide okp
## 1  1      0     0     1                  1   1
## 2  1      0     0     1                  1   1
## 3  1      0     0     1                  1   1
#=======================
cache("Vstichtag20100201",{
  
require(tcltk)
msgBox <- tkmessageBox(title = "User have to do some action.",
                       message = "Please select the data 'Vstichtag20100201'.", icon = "info", type = "ok")

Vstichtag20100201 <- read.table(file.choose(), header=TRUE, sep="\t")
})
##   Skipping cache update for Vstichtag20100201: up to date
classes <- sapply(Vstichtag20090201,class)
#=======================
cache("Altersgruppen2009",{
  
require(tcltk)
msgBox <- tkmessageBox(title = "User have to do some action.",
                       message = "Please select the data 'Altersgruppen2009'.", icon = "info", type = "ok")

Altersgruppen2009 <- read.table(file.choose(), header=TRUE, sep="\t")
})
##   Skipping cache update for Altersgruppen2009: up to date
head(Altersgruppen2009,3)
##   Jahrgang Altersgruppe Alter    Kat AGvon
## 1     2009        00-18     0 Kinder     0
## 2     2008        00-18     1 Kinder     0
## 3     2007        00-18     2 Kinder     0
#=======================
cache("lexj2009",{
  
require(tcltk)
msgBox <- tkmessageBox(title = "User have to do some action.",
                       message = "Please select the data 'lexj2009'.", icon = "info", type = "ok")

lexj2009 <- read_csv2(file.choose(), col_names=TRUE, na="NA", col_types=NULL, n_max=Inf)
})
##   Skipping cache update for lexj2009: up to date
head(lexj2009,3)
## # A tibble: 3 × 80
##   ATPGID    KJ    GJ MANDAT GSAG     VERS  DEBINR ADRSQN TUPNR WKT      PLZ
##   <lgl>  <dbl> <dbl> <chr>  <chr>   <dbl>   <dbl>  <dbl> <dbl> <chr>  <dbl>
## 1 NA      2009  2009 0102   0027  3203956 1115469     99   996 GE    121300
## 2 NA      2004  2009 0102   0027  3203956 1115469     99   996 GE    121300
## 3 NA      2009  2009 0102   0027  2650088 1086816     99   996 GE    122200
## # ℹ 69 more variables: CODESEX <chr>, KAT <chr>, ALTGR <chr>, GEBJAHR <dbl>,
## #   EFFALT <chr>, ABT <chr>, DA <dbl>, KLASSE <chr>, KARENZ <chr>, PEGRP <chr>,
## #   REGION <chr>, CODETARIF <chr>, CODESELB <chr>, VERTNR <chr>, VTYP <chr>,
## #   RSTYP <dbl>, RSNR <chr>, PAR <chr>, VZNR <chr>, VATYP <chr>, VADAT <dbl>,
## #   VASTAT <dbl>, STORNO <chr>, STDAT <dbl>, BVONDAT <dbl>, BBISDAT <dbl>,
## #   AZCODE <dbl>, KOA <chr>, SC <chr>, RVKEY <lgl>, ERFNR <chr>, EID <chr>,
## #   DIABEZ <chr>, FALLID <dbl>, POSNR <dbl>, REBLK <chr>, UNFALL <chr>, …
#=======================
cache("KOA",{
  
require(tcltk)
msgBox <- tkmessageBox(title = "User have to do some action.",
                       message = "Please select the data 'KOA'.", icon = "info", type = "ok")

KOA <- read_csv2(file.choose(), col_names=TRUE, na="NA", col_types=NULL, n_max=Inf)
})
##   Skipping cache update for KOA: up to date
classes_koa <- sapply(KOA,class)
head(KOA,10)
## # A tibble: 10 × 1
##    `Beschreibung,KOA,OKP,Rechnungsart`                           
##    <chr>                                                         
##  1 "A 101 Akupunktur,101 ,A,"                                    
##  2 "A 107 Arztkosten,107 ,A,"                                    
##  3 "A 110 Medikamente OKP,110 ,A,"                               
##  4 "A 114 Einnahmekontrolle Textanpassung 01.01.2007,114 ,A,"    
##  5 "A 115 Medikamenten-Check Textanpassung 01.01.2007,115 ,A,"   
##  6 "A 116 Bezugs-Check Textanpassung 01.01.2007,116 ,A,"         
##  7 "A 117 Substitution/Generika Textanpassung 01.01.2007,117 ,A,"
##  8 "A 118 Notfalldienst Textanpassung 01.01.2007,118 ,A,"        
##  9 "A 120 Chiropraktoren,120 ,A,chiro"                           
## 10 "A 121 Logop\xe4die,121 ,A,logo"

The withdrawals

A raw of the data set Vstichtag corresponds to a single insured person. We determine the withdrawals:

names <- names(Vstichtag20090201)
leftJoin <- merge(x=Vstichtag20090201, y=Vstichtag20100201[,c("versnr","VEGESC")], by="versnr", all.x=TRUE)

We collect the relevant variables (columns) on the far left of the data set:

names <- names(leftJoin)
names <- as.character(names)
names <- as.vector(names)
leftJoin <- select(leftJoin,VEGESC.x,VEGESC.y,names)

We then define a new variable that identifies an insured person as a withdrawal:

leftJoin <- mutate(leftJoin,aus=ifelse(is.na(VEGESC.y),1,0))

the withdrawals:

Abgaenge <- filter(leftJoin,aus==1)
head(Abgaenge,3)
##   VEGESC.x VEGESC.y  versnr  txfanr   VEGDAT VESPCD   VEEDAT VEECOD VEADAT
## 1        M     <NA> 1000152 1000152 19140830      F 19830101     11      0
## 2        W     <NA> 1000756 1000756 19310825      D 19560601     11      0
## 3        W     <NA> 1000829 1000829 19321120      D 19470101     11      0
##   VEAGR  VERSFA VEKANT VEVERTR VENBSV VEPFNR X_18 X_82 X_46 X_47 X_35 X_20 X_86
## 1    NA 1000152     VD             NA    110    1    0    0    0    0    0    0
## 2    NA 1000756     BE             NA    183    0    0    0    0    0    0    0
## 3    NA 1000829     BE             NA    160    1    0    0    0    0    0    0
##   X_84 X_85 X_83 X_25 X_05 X_03 Online Jahrgang Alter ZielKKnr Stichtag
## 1    0    0    0    0    0    0      0     1914    95       NA 20090201
## 2    0    0    0    0    0    0      0     1931    78       NA 20090201
## 3    0    0    0    0    0    0      0     1932    77       NA 20090201
##   Standdatum A1 AP U Z H3 H2 K H1 F H4 A5 A6 N A4 HT B J X APC S P AE KTI C XL
## 1   20090201  1  1 1 1  1  0 0  0 0  0  0  0 0  0  0 0 0 0   0 0 0  0   0 0  0
## 2   20090201  1  1 1 1  0  0 1  1 0  0  0  0 0  0  0 0 0 0   0 0 0  0   0 0  0
## 3   20090201  0  1 1 0  0  1 0  0 0  0  0  0 0  1  0 0 0 0   0 0 0  0   0 0  0
##   KUTI A3 G T A H ZV nurOKP nurZV beide nurOKP_nurZV_beide okp aus
## 1    0  0 0 0 1 1  1      0     0     1                  1   1   1
## 2    0  0 0 0 1 1  1      0     0     1                  1   1   1
## 3    0  0 0 0 1 1  1      0     0     1                  1   1   1

It should be noted that the death of an insured person is considered a withdrawal. The way to remove these “withdrawals” from our considerations would be to use the information on the reason for the withdrawal, which is supposed to be contained in the variable VEAGR (AustrittsGRund). The problem is that no description of the possible values taken by VEAGR is available. For the sake of simplicity, a withdrawal is considered to be an insured person who had at least one insurance coverage (see the binary variables A (=basic insurance), H1 (=semi-private hospital), H2, H3 (=private hospital), Z (=teeth coverage) and so on) with the insurance company in the first year (2009), but no coverage in the following year.

gender, region, age, last year’s costs

Abgaenge <- select(Abgaenge,versnr,VEGESC.x,VEKANT,Alter,aus)
# an observation ~ a person

Abgaenge2 <- merge(x=Abgaenge, y=lexj2009[,c("VERS","BETRBRUT","KOA")],by.x="versnr",by.y="VERS",all.x=TRUE)
head(Abgaenge2,3)
##    versnr VEGESC.x VEKANT Alter aus BETRBRUT KOA
## 1 1000152        M     VD    95   1     4.30 115
## 2 1000152        M     VD    95   1    27.00 130
## 3 1000152        M     VD    95   1     3.25 116
# an observation ~ a person's cost

We create a bar plot in order to visualize the costs (BETRBRUT) per category of cost (KOA) and eventually operate a selection:

unique(Abgaenge2[,"KOA"])
##   [1] "115" "130" "116" "107" "110" "111" "117" "207" "230" "950" "210" "118"
##  [13] "429" "570" "430" "232" "413" "211" "412" "132" "406" "301" "421" "300"
##  [25] "434" "432" "442" "407" "441" "426" "381" "478" "122" "335" "440" NA   
##  [37] "128" "473" "140" "336" "246" "957" "856" "245" "113" "958" "471" "133"
##  [49] "414" "160" "420" "120" "460" "621" "475" "127" "849" "145" "415" "146"
##  [61] "150" "231" "840" "842" "427" "436" "853" "612" "610" "619" "634" "571"
##  [73] "625" "620" "623" "618" "626" "615" "951" "121" "551" "428" "141" "561"
##  [85] "843" "443" "461" "450" "847" "124" "340" "391" "126" "125" "002" "260"
##  [97] "437" "262" "101" "435" "263" "998" "472" "341" "447" "575" "445" "850"
## [109] "495" "496" "213" "952" "433" "317" "114" "390" "311" "453" "312" "221"
## [121] "846" "261" "479" "361" "371" "200" "474" "170" "633" "857" "171" "860"
## [133] "365" "658" "201" "624" "854" "550" "313" "500" "691" "851" "572" "351"
## [145] "152" "151" "702" "485" "438" "858" "484" "477" "617" "628" "493" "611"
## [157] "616" "494" "690"
Abgaenge3 <- group_by(Abgaenge2,KOA)
# an observation ~ a cost category

# We notice that some variables have a bad format, especially when we try to sum them and they are not numerical. We therefore correct this:
Abgaenge3$BETRBRUT <- as.numeric(Abgaenge3$BETRBRUT)
Abgaenge4 <- summarise(Abgaenge3,count=n(),sum=sum(BETRBRUT))
Abgaenge4 <- arrange(Abgaenge4,desc(sum))
# an observation ~ a cost category

Abgaenge4 <- Abgaenge4[1:10,]
head(Abgaenge4,3)
## # A tibble: 3 × 3
##   KOA   count       sum
##   <chr> <int>     <dbl>
## 1 300    4348 22008061.
## 2 301    1959 11562999.
## 3 140    4912  9299465.
barplot(Abgaenge4$sum, names.arg = Abgaenge4$KOA, ylab = "Cost (BETRBRUT)", xlab = "category of cost (KOA)", col = "blue", main = "Cost per category of cost")

We decide to consider solely the costs corresponding to the categories KOA = 300, 301, 140 and 110.

Abgaenge2$BETRBRUT <- as.numeric(Abgaenge2$BETRBRUT)
Abgaenge5 <- filter(Abgaenge2,KOA %in% c(300,301,140,110))
# an observation ~ a person's cost

Abgaenge6 <- group_by(Abgaenge5,versnr)
Abgaenge6 <- summarise(Abgaenge6,BETRBRUT=sum(BETRBRUT))
# an observation ~ a person

Abgaenge <- merge(x=Abgaenge,y=Abgaenge6,by="versnr", all.x=TRUE)
# an observation ~ a person
# warning: new variable created (BETRBRUT) with N/A values

print(names(Abgaenge))
## [1] "versnr"   "VEGESC.x" "VEKANT"   "Alter"    "aus"      "BETRBRUT"
names(Abgaenge) <- c("persId","gender","canton","age","out","cost")
head(Abgaenge,3)
##    persId gender canton age out    cost
## 1 1000152      M     VD  95   1  2165.7
## 2 1000756      W     BE  78   1  1063.5
## 3 1000829      W     BE  77   1 52524.4

To perform a regression, we need to add to the Abgaenge data set observations on insured persons who did not leave the health insurance company (the remainder). We produce the needed data set and “row bind” it to Abgaenge. We obtain the data set insured2009.

leftJoin2 <- select(leftJoin,versnr,VEGESC.x,VEKANT,Alter)
# an observation ~ a person

leftJoin3 <- merge(x=leftJoin2, y=lexj2009[,c("VERS","BETRBRUT","KOA")],by.x="versnr",by.y="VERS",all.x=TRUE)
# an observation ~ a person's cost

leftJoin3$BETRBRUT <- as.numeric(leftJoin3$BETRBRUT)
leftJoin4 <- filter(leftJoin3,KOA %in% c(300,301,140,110))
leftJoin5 <- group_by(leftJoin4,versnr)
leftJoin5 <- summarise(leftJoin5,BETRBRUT=sum(BETRBRUT))
# an observation ~ a person

leftJoin6 <- merge(x=leftJoin2,y=leftJoin5,by="versnr",all.x=TRUE)
# an observation ~ a person
# warning: new variable created (BETRBRUT) with N/A values

print(names(leftJoin6))
## [1] "versnr"   "VEGESC.x" "VEKANT"   "Alter"    "BETRBRUT"
names(leftJoin6) <- c("persId","gender","canton","age","cost")
leftJoin6 <- mutate(leftJoin6,out=0)

remainder <- anti_join(leftJoin6,Abgaenge,by="persId")
# an observation ~ a person
# warning: variable (cost) with N/A values

insured2009 <- rbind(Abgaenge,remainder)
insured2009 <- mutate(insured2009,cost=ifelse(is.na(cost),0,cost))
# an observation ~ a person
# no variable with N/A values

Quality control:
-the number of observations in insured2009 must be equal to that in Vstichtag20090201;
-the produced tables are checked for duplicate policyholders:

data <- Vstichtag20090201
doublons <- data[duplicated(data[,"versnr"]),]

data <- Abgaenge
doublons <- data[duplicated(data[,"persId"]),]

data <- insured2009
doublons <- data[duplicated(data[,"persId"]),]

The categorical variables gender and canton have to be formatted as factors in order to perform a regression:

insured2009$gender <- as.factor(insured2009$gender)
insured2009$canton <- as.factor(insured2009$canton)
head(insured2009,10)
##     persId gender canton age out     cost
## 1  1000152      M     VD  95   1  2165.70
## 2  1000756      W     BE  78   1  1063.50
## 3  1000829      W     BE  77   1 52524.40
## 4  1000977      M     TI  73   1 16185.45
## 5  1001191      W     BE  68   1 11163.65
## 6  1001345      W     BE  66   1     0.00
## 7  1003062      W     GE  49   1     0.00
## 8  1003070      W     AG  49   1     2.55
## 9  1003682      W     VS  45   1     0.00
## 10 1003720      M     VS  44   1     0.00

That’s it! We have obtained the basic data table needed for further statistical analysis: the Abgaenge table with all the variables specified at the beginning.

The statistical model

We want to explain the withdrawal of the insured from the insurance using the following variables: gender, region, age, last year’s costs.

We are given the following explanatory variables:

\(GENDER\) a categorical binary variable with values \(M\) for “male” and \(W\) for “female”

\(AGE\) a metric variable with values in \(\{0,1,2,\ldots,120\}\)

\(COST\) a metric variable with values in \(\mathbb{R}_+\)

\(CANTON\) : a categorical non binary variable

These explanatory variables are used to explain whether or not insured persons leave the health insurance company, i.e. to explain the following response variable:

\(OUT \sim \mathcal{B}ern(\pi)\), with values 0 for “no withdrawal” and 1 for “withdrawal”

where the parameter \(\pi\) depends on the explanatory variables according to the statistical model presented below:

\[ \pi = p(OUT=1 \,|\, X) = \frac{1}{1+\exp(-(X\cdot\beta^T))} \]

where \(X\) -called the design matrix- is a \(n\times (p+1)\)-matrix, \(n\) being the number of observations and \(p\) the number of explanatory variables. The matrix \(X\) has a first column containing solely 1’s, each other column being for one of the \(p\) explanatory variables. The linear predictor \(\eta :=X\cdot\beta^T\) becomes then:

\[ \eta= \begin{pmatrix} \eta_1 \\ \eta_2 \\ \eta_3 \\ \eta_4 \\ \eta_5 \\ \eta_5 \\ \vdots \end{pmatrix} := \begin{pmatrix} 1 & W & c_2 & 67 & 1493\\ 1 & W & c_3 & 73 & 1577\\ 1 & M & c_1 & 26 & 0\\ 1 & M & c_3 & 12 & 142\\ 1 & W & c_1 & 27 & 840\\ 1 & M & c_2 & 54 & 2314\\ \vdots & \vdots & \vdots & \vdots \ & \vdots\\ \end{pmatrix} \cdot \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \end{pmatrix} \]

There’s just a problem here: the second and third columns are not numerical! The solution is to recode them, as explained with an example in the following section.

Full disjunctive coding for categorical variables

The numerical columns of the design matrix \(X\) don’t need to be changed, whereas the categorical columns all have to be recoded in the following manner.

Consider a column \(v\) of \(X\) admitting three categories \(c_1,c_2,c_3\). We proceed to the following “dummy”-recoding:

\[ \begin{pmatrix} c_2\\c_1\\c_1\\c_3\\c_1\\c_2\\ \vdots \end{pmatrix} \longrightarrow \begin{pmatrix} 0 & 1 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0\\ \vdots & \vdots & \vdots \end{pmatrix} \]

Motivation for such a recoding: It is in fact a question of carrying out as many different regressions as there are categories - we speak of levels. For the \(c_1\) level this gives the simple regression

\[ \eta_{i,c_1} = \beta_{0,c_1}+\beta_{1,c_1}\cdot x_{i,c_1}+\varepsilon_{i,c_1}\,, \] and it is the same for the other categories \(c_2\) and \(c_3\). By using an index \(j\) running through the categories: \(j\in\{c_1,c_2,c_3\}\) we come to consider:

\[ \eta_{i,j} = \beta_{0,j}+\beta_{1,j}\cdot x_{i,j}+\varepsilon_{i,j}\quad, j\in\{c_1,c_2,c_3\} \]

Modelization of the problem of taking account of the categorical variables

The proposed recoding replaces in \(X\) the column \(v\) with as many columns \(v_i\) as there are levels for \(v\). The problem is then that the first column \(w\) of \(X\), that being filled with 1’s, is a linear combination of the \(v_i\) because they always sum to 1 :

\[ w=v_1+v_2+v_3 \]

A design matrix cannot admit a linearly dependent column. A solution were to delete one of the \(k=3\) columns \(v_i\) -say the last column \(v_k\).

Let’s write \(B_v\) the \(n\times k-\)matrix \([v_1|v_2|\cdots|v_k]\). Yet a more general solution consists in multiplying \(B_v\) with a so called contrast matrix \(C_v\) of dimensions \(n\times (k-1)\). The resulting matrix \(B_v\cdot C_v=:\tilde{B}_v\) is then a matrix of dimensions \(n\times(k-1)\):

\[\begin{eqnarray*} \begin{pmatrix} c_2\\c_1\\c_1\\c_3\\c_1\\c_2\\ \vdots \end{pmatrix} &\longrightarrow& \begin{pmatrix} 0 & 1 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0\\ \vdots & \vdots & \vdots \end{pmatrix} \\ &\longrightarrow& \begin{pmatrix} 0 & 1 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0\\ \vdots & \vdots & \vdots \end{pmatrix} \cdot \begin{pmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ \end{pmatrix} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \\ 1 & 0 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ \vdots & \vdots \end{pmatrix} =\tilde{B}_v \end{eqnarray*}\]

We then do the same for each categorical column in our \(X\) defined above, which gives us:

\[\begin{eqnarray*} X &=& \begin{pmatrix} 1 & W & ZH & 67 & 1493\\ 1 & W & GE & 73 & 1577\\ 1 & M & BE & 26 & 0\\ 1 & M & ZH & 12 & 142\\ 1 & W & BE & 27 & 840\\ 1 & M & BE & 54 & 2314\\ \vdots & \vdots & \vdots & \vdots\\ \end{pmatrix} \\ &=& \left( \mathbf{1} \;|\: GENDER \;|\: CANTON \;|\: AGE \;|\: COST \right) \\ &\longrightarrow& \left( \mathbf{1} \;|\: \tilde{B}_\mbox{GENDER}\cdot C_\mbox{GENDER} \;|\: \tilde{B}_\mbox{CANTON}\cdot C_\mbox{CANTON} \;|\: AGE \;|\: COST \right) \\ &=& \begin{pmatrix} \begin{array}{c|c|cc|cc} 1 & 0 & 0 & 0 & 67 & 1493\\ 1 & 0 & 0 & 1 & 73 & 1577\\ 1 & 1 & 1 & 0 & 26 & 0\\ 1 & 1 & 0 & 0 & 12 & 142\\ 1 & 0 & 1 & 0 & 27 & 840\\ 1 & 1 & 1 & 0 & 54 & 2314\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ \end{array} \end{pmatrix} \end{eqnarray*}\]

where, to simplify the presentation, we consider that the variable \(CANTON\) has only 3 levels instead of 26.

The contrast matrix for the variable \(CANTON\) (with not 3 but all 26 levels):

options(width=10000) # to avoid the line-breaking of a table with many columns when printed in the RMarkdown output document
contrasts(insured2009$canton)
##    AI AR AU BE BL BS DE FL FR GE GL GR IT JU LU NE NW OW SG SH SO SZ TG TI UR VD VS ZG ZH
## AG  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## AI  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## AR  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## AU  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## BE  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## BL  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## BS  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## DE  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## FL  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## FR  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## GE  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## GL  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## GR  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## IT  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## JU  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## LU  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## NE  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0
## NW  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0
## OW  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0
## SG  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0
## SH  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0
## SO  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
## SZ  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
## TG  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0
## TI  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0
## UR  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0
## VD  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
## VS  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0
## ZG  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
## ZH  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1

With our adapted design matrix -which we will also call \(X\)- we can finally write the linear predictor \(\eta=X\cdot\beta^T\):

\[ \eta= \begin{pmatrix} \eta_1 \\ \eta_2 \\ \eta_3 \\ \eta_4 \\ \eta_5 \\ \eta_5 \\ \vdots \end{pmatrix} := \begin{pmatrix} \begin{array}{c|c|cc|cc} 1 & 0 & 0 & 0 & 67 & 1493\\ 1 & 0 & 0 & 1 & 73 & 1577\\ 1 & 1 & 1 & 0 & 26 & 0\\ 1 & 1 & 0 & 0 & 12 & 142\\ 1 & 0 & 1 & 0 & 27 & 840\\ 1 & 1 & 1 & 0 & 54 & 2314\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ \end{array} \end{pmatrix} \cdot \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \beta_5 \end{pmatrix} \]

Model fitting

Have in mind that the linear predictor \(\eta=X\cdot\beta^T\) is equal to the mean parameter \(\mu\), a random real variable whose realization vector is the \(N\times 1\)-vector \((\mu_1,\ldots,\mu_n)^T\).

The likelihood is:

\[ \begin{align} \large L = \displaystyle\prod_{i=1}^N \pi_i^{OUT_i}(1-\pi_i)^{1-OUT_i} \end{align} \]

where \(\pi_i=\frac{1}{1+\exp(-(x_i\cdot\beta^T))}\), \(x_i\) being the \(i\)-th raw of the design matrix \(X\).

The negative log-likelihood for logistic regression is then given by

\[ NLL(\beta) = -\sum_{i=1}^N \left( OUT_i\cdot\log(\pi_i) + (1-OUT_i)\cdot\log(1-\pi_i) \right) \]

Unlike linear regression, we cannot write down the MLE in closed form. Instead, we need to use an optimization algorithm to compute it. For this purpose, we need to derive the gradient and Hessian matrix.

\[ \nabla NLL(\beta) = \begin{pmatrix} \begin{array}{l} \partial_{GENDER} NLL \\ \partial_{CANTON} NLL \\ \partial_{AGE} NLL \\ \partial_{COST} NLL \\ \end{array} \end{pmatrix} \]

Using the chain rule and recalling that \(\pi=\frac{1}{1+\exp(-\eta)}\) and that \(\mu=\eta = X\cdot\beta^T\): \[ \begin{align} \frac{\partial NLL}{\partial \beta_i} = \displaystyle \sum_{n=1}^N \frac{\partial NLL}{\partial \pi_n}\frac{\partial \pi_n}{\partial \eta_n}\frac{\partial \eta_n}{\partial \beta_i} \end{align} \]

Thus, we are looking to obtain three different derivatives.

First we calculate

\[ \begin{align} \frac{\partial NLL}{\partial \pi_n} = OUT_n \frac{1}{\pi_n} + (1-OUT_n) \frac{1}{1-\pi_n}(-1) = \frac{OUT_n}{\pi_n} - \frac{1-OUT_n}{1-\pi_n} \end{align} \]

Next, we solve for the derivative of \(\pi\) with respect to the linear predictor function \(\eta\), this for each observation \(n\):

\[ \begin{align} \large \pi_n(\eta_n) = \frac{1}{1+e^{-\eta_n}} \end{align} \]

\[ \begin{align} \frac{\partial \pi_n}{\partial \eta_n} = \frac{e^{-\eta_n}}{(1+e^{-\eta_n})^2} = \frac{1}{1+e^{-\eta_n}} \frac{e^{-\eta_n}}{1+e^{-\eta_n}} = \pi_n-\pi_n^2 = \pi_n(1-\pi_n) \end{align} \]

because of the algebraic relationship

\[ \frac{1}{1+z}-\left(\frac{1}{1+z}\right)^2=\frac{z}{(1+z)^2} \]

Lastly, we solve for the derivative of the linear predictor function with respect to the weights:

\[ \begin{eqnarray*} \eta_n &=& x_i\cdot\beta^T \\ &=& \beta_0x_{n0}+\beta_1x_{n1}+\beta_2x_{n2}+⋯+\beta_Nx_{np} \end{eqnarray*} \]

\[ \frac{\partial \eta_n}{\partial \beta_i} = x_{ni} \]

We now put it all together:

\[ \begin{eqnarray*} \frac{\partial \,NLL}{\partial \beta_i} &=& - \sum_{n=1}^N\frac{OUT_n}{\pi_n}\pi_n(1-\pi_n)x_{ni}-\frac{1-OUT_n}{1-\pi_n}\pi_n(1-\pi_n)x_{ni} \\ &=& \sum_{n=1}^N OUT_n(1-\pi_n)x_{ni}-(1-OUT_n)\pi_n\,x_{ni} \\ &=& \sum_{n=1}^N[OUT_n-OUT_n\,\pi_n-\pi_n+OUT_n\,\pi]\,x_{ni} \\ &=& \sum_{n=1}^N(\pi_n-OUT_n)\,x_{ni} \\ &=& \frac{\partial\, NLL}{\partial \beta} \\ &=& \sum_{n=1}^{N}(\pi_n-OUT_n)\,x_n \\ &=& X\cdot(\pi-OUT) \end{eqnarray*} \]

Therefore, the gradient with respect to \(\beta\) is the column vector

\[ \mbox{grad } NLL = \frac{\partial \,NLL}{\partial \beta}=X\cdot(\pi-OUT). \]

Analogously we obtain the Hessian matrix:

\[ \begin{eqnarray*} H &=& \frac{\partial \,(\mbox{grad } NLL)^T}{\partial \beta} = \sum_{n=1}^N (\nabla_\beta \mu_i)\cdot x_i^T = \sum_{n=1}^N \mu_i(1-\mu_i)x_i\,x_i^T \\ &=& X^T\cdot S\cdot X \end{eqnarray*} \]

where \(S:=\mbox{diag}(\mu_i\cdot(1-\mu_i))\). One has to verify whether \(H\) is positive definite. When it is the case, the \(NLL\) is convex and has a unique global minimum.

The gradient descent algorithm

Consider a given start value \(\theta_0\) for the parameter vector \(\theta\) one wants to correspond to a global minimum of a given real valued function \(f(\theta)\). The method consists in modifying step wise the parameter vector \[ \theta_{k+1}=\theta_k-\lambda_k\cdot\mbox{grad }f(\theta_k) \]

where \(\lambda_k\) is called, depending on the context, the step size or the learning rate. The gradient is a vector whose direction and sense leads you to a higher value if the function \(f\) is convex. For a minimum (we want to minimize the \(NLL\)), take minus instead of plus \(\mbox{grad }f(\theta_k)\). Therefore one has to control the convexity of the Hessian.

Motivation of the method

By Taylor’s theorem, we have \[ f(\theta_k+\lambda_k\cdot d_k) \approx f(\theta_k)+\lambda_k\cdot(\mbox{grad }f(\theta_k+\lambda_k\cdot d_k))\cdot d_k\,, \]

where \(d_k\) is the descent direction at step \(k\) and \(\mbox{grad }f(\theta_k+\lambda_k\cdot d_k)\) is the gradient at the end of the step. So if the step size \(\lambda_k\) is chosen small enough, then \(f(\theta+\lambda\cdot d) < f(\theta)\). Let us determine \(d_k\) that minimizes

\[ \phi(\lambda_k):=f(\theta_k+\lambda_k\cdot d_k) \]

A necessary condition for the minimum is \(\phi'(\lambda_k)=0\). By the chain rule,

\[ \phi'(\lambda_k) = (d_k)^T\cdot\mbox{grad }f(\theta_k+\lambda_k\cdot d_k) = 0 \]

So we either have \(\mbox{grad }f(\theta_k+\lambda_k\cdot d_k)=0\), which means we have found a stationary point, or

\[ \mbox{grad }f(\theta_k+\lambda_k\cdot d_k) \perp d_k\,, \]

which means that exact search stops at a point where the local gradient (i.e. the gradient at the end of the step) is perpendicular to the search direction. Hence consecutive directions will be orthogonal. That fact motivates the choice of the descent direction vector being \(d_k=\mbox{grad }f(\theta_k)\), because when we choose any other \(d_k\), at the latest after one more step, the new direction will be the local gradient.

Implementation

The design matrix has to be dummy-recoded.

head(insured2009,10)
##     persId gender canton age out     cost
## 1  1000152      M     VD  95   1  2165.70
## 2  1000756      W     BE  78   1  1063.50
## 3  1000829      W     BE  77   1 52524.40
## 4  1000977      M     TI  73   1 16185.45
## 5  1001191      W     BE  68   1 11163.65
## 6  1001345      W     BE  66   1     0.00
## 7  1003062      W     GE  49   1     0.00
## 8  1003070      W     AG  49   1     2.55
## 9  1003682      W     VS  45   1     0.00
## 10 1003720      M     VS  44   1     0.00
options(width=10000) # to avoid the line-breaking of a table with many columns when printed in the RMarkdown output document
library(fastDummies)
X <- dummy_cols(insured2009, select_columns = c("gender","canton"),
           remove_selected_columns = TRUE, remove_first_dummy  = TRUE)
X <- as.matrix(X)
X <- X[,-c(1,3)]
head(X,10)
##       age     cost gender_W canton_AI canton_AR canton_AU canton_BE canton_BL canton_BS canton_DE canton_FL canton_FR canton_GE canton_GL canton_GR canton_IT canton_JU canton_LU canton_NE canton_NW canton_OW canton_SG canton_SH canton_SO canton_SZ canton_TG canton_TI canton_UR canton_VD canton_VS canton_ZG canton_ZH
##  [1,]  95  2165.70        0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         1         0         0         0
##  [2,]  78  1063.50        1         0         0         0         1         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0
##  [3,]  77 52524.40        1         0         0         0         1         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0
##  [4,]  73 16185.45        0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         1         0         0         0         0         0
##  [5,]  68 11163.65        1         0         0         0         1         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0
##  [6,]  66     0.00        1         0         0         0         1         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0
##  [7,]  49     0.00        1         0         0         0         0         0         0         0         0         0         1         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0
##  [8,]  49     2.55        1         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0
##  [9,]  45     0.00        1         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         1         0         0
## [10,]  44     0.00        0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         0         1         0         0
# Calculate the rank of the matrix X
rank_X <- qr(X)$rank
if (rank_X == ncol(X)) {
  print("The columns of X are linearly independent.")
} else {
  print("The columns of X are not linearly independent.")
}
## [1] "The columns of X are linearly independent."
cat("dim(X)=",dim(X),"\n")
## dim(X)= 328253 32

the full model has 4 parameter \(\beta_1,\ldots,\beta_4\) plus the intercept \(\beta_0\), so that the dimension of the parameter vector \(\theta:=\beta\) is 5. After dummy-recoding, the model has 32 parameters.

dim(X)
## [1] 328253     32

We recall that

\[ \theta_{k+1}=\theta_k-\lambda_k\cdot\mbox{grad }f(\theta_k) \] and that

\[ \mbox{grad } NLL = \frac{\partial \,NLL}{\partial \beta}=X\cdot(\pi-OUT). \] Choose a start value for the 32-th dimensional \(\theta_0\).
Choose a step size (being supposed a constant for the sake of simplicity).

theta_0 <- vector("numeric",length=32)    # start value
lambda <- 0.05                            # size step

theta <- t(theta_0)
erg <- theta
for(j in 1:1000){
  eta <- X %*% t(theta)
  pi <- 1/(1+exp(-eta))
  out <- X[,3]
  grad <-  t(pi-out) %*% X
  theta <- theta - lambda * grad
  if(j %% 50 == 0){
    erg <- rbind(erg,theta)
  }
}
print(erg)
##              age      cost  gender_W canton_AI canton_AR canton_AU  canton_BE canton_BL canton_BS canton_DE canton_FL canton_FR canton_GE canton_GL canton_GR canton_IT canton_JU canton_LU canton_NE canton_NW canton_OW canton_SG canton_SH   canton_SO canton_SZ canton_TG canton_TI canton_UR canton_VD canton_VS canton_ZG canton_ZH
##  [1,]       0.00         0       0.0      0.00      0.00      0.00      0.000      0.00      0.00      0.00     0.000     0.000     0.000      0.00     0.000     0.000     0.000     0.000     0.000      0.00     0.000      0.00     0.000     0.00000      0.00      0.00      0.00     0.000      0.00     0.000      0.00      0.00
##  [2,]   42621.80   5108308  202439.3    -41.85     32.40    155.30  -1375.025   -170.45    252.45      0.10   -10.975  -134.425   591.125     52.15   -61.975    -1.375   -22.725  -266.025   752.325   -287.90   -45.975   -486.60    20.125    77.75000   -112.15    -52.20    529.25  -167.175   2445.00   108.975   -152.95   -455.20
##  [3,] -289061.75  -2743883  399641.5    -94.05     16.70    284.45  -5263.375   -575.70    417.45      0.10   -23.925  -676.725   755.925     87.90  -384.575    -2.775   -83.775  -822.875  1380.675   -656.45  -108.275  -1486.70    -2.325   -88.10000   -293.10   -239.40    210.85  -380.475   4005.30  -103.125   -376.60  -2093.55
##  [4,] -338099.10  -7171532  600846.1   -142.95      9.60    395.90  -7679.825   -814.40    601.05      0.15   -35.675  -922.575  1243.025    129.40  -534.125    -4.175  -134.875 -1193.075  1928.125   -985.75  -162.125  -2148.85    -6.075  -124.55000   -430.70   -384.20    607.55  -555.925   5992.90   -58.425   -576.05  -3049.95
##  [5,]  133856.35  10989246  807095.4   -179.25     84.30    572.85  -6670.375   -752.05    921.30      0.40   -44.425  -667.275  2241.525    198.10  -328.075    -5.525  -120.575 -1170.275  2769.675  -1203.15  -195.125  -2129.00    56.325   173.45000   -472.20   -324.10   2009.60  -671.225   9219.70   370.275   -675.90  -2371.45
##  [6,]   90858.70   3137222 1004669.1   -224.30    116.40    737.80  -8589.575   -981.35   1154.95      0.55   -55.475  -920.575  2665.725    247.80  -443.775    -6.925  -142.625 -1489.775  3567.925  -1502.25  -244.875  -2702.20    75.375   197.70000   -590.55   -387.45   2250.70  -845.175  11453.50   386.575   -845.80  -3037.15
##  [7,] -267788.35  -4725050 1200501.0   -278.70     94.10    862.05 -12969.025  -1426.85   1296.95      0.55   -68.475 -1544.975  2780.175    281.80  -802.775    -8.325  -206.125 -2102.725  4188.825  -1892.45  -312.525  -3802.05    48.675   -18.05000   -781.80   -605.95   1806.75 -1067.175  12876.35   127.075  -1101.45  -4920.20
##  [8,]   92320.39  13387265 1403315.8   -319.00    150.80   1020.50 -12982.725  -1444.45   1577.05      0.75   -77.925 -1445.575  3646.125    344.00  -682.625    -9.675  -207.575 -2190.225  4952.775  -2142.50  -354.025  -3971.75    94.825   178.02013   -847.20   -604.95   2929.35 -1198.075  15745.95   447.475  -1243.40  -4739.30
##  [9,]   53363.04   5535892 1599606.6   -365.60    181.05   1185.75 -15245.725  -1696.80   1799.70      0.90   -88.775 -1772.625  4028.975    393.50  -816.575   -11.075  -234.175 -2540.925  5742.125  -2455.55  -409.225  -4609.55   110.625   165.12013   -972.20   -692.00   3084.70 -1375.825  17842.70   429.025  -1439.10  -5592.25
## [10,] -187206.36  -2324116 1793887.6   -418.25    176.60   1322.15 -19021.575  -2085.85   1963.15      0.95  -101.075 -2336.225  4208.525    432.80 -1109.775   -12.475  -287.425 -3076.475  6409.975  -2824.75  -475.625  -5585.55    96.075    -0.42987  -1142.15   -875.95   2800.45 -1583.875  19398.00   232.675  -1684.65  -7215.60
## [11,] -304238.66 -10105196 1989502.7   -468.60    185.95   1470.95 -22246.725  -2409.40   2152.55      1.10  -112.425 -2812.525  4484.475    477.45 -1330.575   -13.875  -332.425 -3520.575  7116.375  -3173.95  -539.225  -6426.55    91.275  -105.97987  -1290.50  -1028.30   2736.95 -1775.275  21136.80   122.975  -1912.70  -8525.90
## [12,]  125501.34   7967075 2190160.0   -508.45    255.50   1640.35 -22062.325  -2406.40   2448.20      1.30  -121.375 -2721.425  5381.575    541.85 -1190.525   -15.225  -328.775 -3574.925  7926.575  -3417.30  -582.375  -6580.45   141.975   108.82013  -1346.00  -1013.40   3881.00 -1903.575  24045.10   462.525  -2054.30  -8263.30
## [13,] -181664.66    113401 2381509.7   -563.85    234.55   1757.55 -26636.825  -2848.80   2586.50      1.30  -133.975 -3413.975  5475.625    575.35 -1556.175   -16.625  -397.275 -4176.375  8512.475  -3811.30  -656.575  -7717.25   110.525  -134.62987  -1530.55  -1248.75   3396.55 -2121.825  25241.65   191.825  -2323.70 -10227.40
## [14,] -228067.79  -7704596 2574992.5   -612.80    251.70   1912.35 -29570.725  -3141.65   2792.65      1.50  -144.825 -3875.125  5791.425    622.70 -1755.825   -18.025  -435.875 -4577.725  9242.375  -4148.85  -720.525  -8513.75   108.825  -219.37987  -1667.30  -1386.20   3399.10 -2307.525  27010.80   109.175  -2539.10 -11380.01
## [15,]   41858.56  10368937 2773144.9   -657.90    283.15   2057.05 -30984.275  -3273.40   3035.45      1.65  -154.625 -4055.775  6496.875    677.25 -1779.825   -19.375  -458.825 -4808.275  9927.225  -4451.40  -778.625  -9032.20   125.425  -171.17987  -1766.05  -1480.45   4102.90 -2465.775  29287.40   267.375  -2730.35 -11885.96
## [16,] -188419.74   2597151 2963387.8   -712.45    256.85   2180.95 -35603.575  -3706.20   3178.50      1.75  -166.675 -4774.775  6605.975    712.10 -2141.775   -20.775  -527.475 -5390.725 10494.325  -4846.60  -856.275 -10178.25    89.775  -430.12987  -1947.30  -1732.90   3665.80 -2678.825  30431.70     1.025  -2999.05 -13837.11
## [17,] -323068.64  -5262336 3154217.7   -764.30    252.25   2321.10 -39421.125  -4076.90   3356.80      1.85  -177.875 -5399.775  6815.525    753.50 -2436.675   -22.175  -577.125 -5890.525 11162.225  -5214.60  -929.025 -11185.10    70.825  -606.32987  -2108.90  -1933.25   3399.35 -2883.875  31864.25  -195.425  -3241.50 -15412.21
## [18,]  169603.21  12809139 3349775.4   -804.85    304.30   2488.60 -39713.625  -4095.35   3650.85      2.10  -186.025 -5441.775  7671.925    817.50 -2345.225   -23.525  -577.375 -5964.125 11922.225  -5469.60  -980.675 -11464.50   107.725  -452.92987  -2171.05  -1969.60   4442.70 -3019.325  34415.05    79.625  -3388.90 -15287.16
## [19,]  -14822.05   4992252 3539599.9   -859.50    270.60   2611.60 -43550.375  -4537.15   3793.35      2.20  -197.825 -6203.475  7742.025    852.45 -2724.425   -24.925  -645.075 -6551.575 12461.575  -5869.35 -1061.825 -12612.25    66.925  -728.82987  -2359.65  -2241.95   3952.10 -3237.175  35165.30  -221.075  -3659.90 -17091.06
## [20,] -249474.85  -2860290 3726017.6   -915.25    231.30   2730.40 -48485.975  -4998.05   3931.65      2.25  -209.775 -7012.225  7805.375    884.80 -3129.375   -26.325  -715.925 -7164.725 12997.225  -6270.20 -1143.325 -13845.50    23.525 -1025.42987  -2552.60  -2529.65   3403.70 -3460.325  36120.70  -554.625  -3934.55 -19140.66
## [21,] -400232.10 -10722034 3915273.4   -968.25    213.70   2868.40 -52738.775  -5401.45   4099.90      2.35  -220.975 -7718.375  7967.025    924.00 -3464.325   -27.725  -770.975 -7703.825 13632.425  -6652.00 -1220.575 -14949.90    -4.075 -1248.77987  -2725.30  -2765.00   3034.45 -3672.925  37382.10  -802.875  -4188.95 -20896.76

Choice of a model

One should first ask himself:

  1. Is at least one of the predictors useful in predicting the response?
  2. Do all the predictors help to explain the response, or is only a subset of them useful?
  3. How well does the model fit the data?
  4. Given a set of predictor values, what response value should we predict, and how accurate our prediction?

We address each of these questions in turn.

The first question

We test the hypothesis \[ \begin{array}{l} H_0:\beta_1=\beta_2=\cdots=\beta_p=0 \\ \qquad \mbox{against the hypothesis} \\ H_1:\mbox{at least one }\beta_j\neq 0 \end{array} \]

This test is performed by computing the \(F\)-statistic.

About the \(F\)-statistic

Consider two embedded linear models. First model : \(Y=X\cdot\beta+\varepsilon\), where \(\varepsilon\sim\mathcal{N}(0,\sigma^2I)\). This means that \(E[Y]\) is in the space \(Space(X)\) generated by the columns of \(X\). Second model: the same as the first but the last \(q\) \(\beta\)-coefficients are zero, where \(q\leq p\), and \(p\) is the number of variables of the first model. So the second model is: \(Y=X_0\cdot\beta_0+\varepsilon_0\), where \(\varepsilon_0\sim\mathcal{N}(0,\sigma^2I)\) (the same as for the first model). Here \(X_0\) is the matrix composed of the \(p-q\) first columns of \(X\). When the second model is right, \(E[Y]\) is in the subspace \(Space(X_0)\) of \(Space(X)\), generated by the columns of \(X_0\).

In order to decide to keep or not the last \(q\) variables, one wants to test: \[ \begin{array}{l} H_0:\beta_{p-q+1}=\cdots=\beta_p=0 \\ \qquad \mbox{against the hypothesis} \\ H_1:\mbox{each }\beta_j\neq 0 \end{array} \]

What could then be an adequate test statistic? Under the \(H_0\)-hypothesis, we have that \(E[Y]\in Space(X_0)\). In that case the least square method consists in projecting the vector \(Y\) not on \(Space(X)\) but on \(Space(X_0)\) in order to obtain \(\hat{Y}_0\).

geometrical representation of the projections

The idea of the test is to keep \(H_0\) when \(\hat{Y}_0\) is close to the projection of \(Y\) on \(Space(X)\), what we denote with \(\hat{Y}\). Indeed, if the information brought by both models is almost the same, then better keep the smaller model.

To quantify this closeness between \(\hat{Y_0}\) and \(\hat{Y}\), one has to note that -say- the euclidean distance depends on the measurement units of the data. That’s why such a distance has to be standardized, and why not by dividing with the square of the euclidean norm of the error vector \(\hat{\varepsilon}=||Y-\hat{Y}||\). Thereby the vectors \(\hat{\varepsilon}\) and \(\hat{\varepsilon}=Y-\hat{Y}\) don’t belong to spaces of same dimension, and therefore one has to divide each term by its freedom degree. The following test statistic meets our expectations:

\[ F:=\frac{||\hat{Y}_0-\hat{Y}||^2\big/q}{||Y-\hat{Y}||^2\big/(n-p)} \]

where \(n\) is the number of observations, and \(q\) is the number of variables of the smaller model.

One can show that the denominator is independent of the numerator, each one following a \(\chi^2\) distribution. The quotient follows thus a Fisher distribution.

When there is no relationship between the response and the set of predictors, one would expect the \(F\)-statistic to take a value close to 1. On the other hand, if the alternative hypothesis \(H_1\) is true, we expect \(F\) to be greater than 1.

How large need the \(F\)-statistic to be before we can reject \(H_0\) and conclude that there actually is a relationship? It turns out that the answer depends on the values of \(n\) and \(p\). When \(n\) is large, and that’s definitively the case here, an \(F\)-statistic that is just a little larger than 1 might still provide evidence against \(H_0\). In our case and if one conclude on the basis of the p-value of the \(F\)-test, one has to reject \(H_0\) and consider that there is a relationship.

model_full <- lm(out ~ gender + canton + age + cost, data=insured2009)
summary(model_full)
## 
## Call:
## lm(formula = out ~ gender + canton + age + cost, data = insured2009)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.45924 -0.06193 -0.04744 -0.03389  1.00238 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.428e-02  2.316e-03  32.080  < 2e-16 ***
## genderW     -5.921e-03  7.675e-04  -7.715 1.22e-14 ***
## cantonAI     1.623e-02  1.295e-02   1.254 0.209955    
## cantonAR    -3.572e-03  5.792e-03  -0.617 0.537381    
## cantonAU     2.872e-02  7.465e-03   3.848 0.000119 ***
## cantonBE     1.990e-03  2.287e-03   0.870 0.384089    
## cantonBL    -1.117e-02  3.221e-03  -3.467 0.000526 ***
## cantonBS     1.150e-02  4.558e-03   2.524 0.011595 *  
## cantonDE    -2.286e-02  1.098e-01  -0.208 0.835126    
## cantonFL     5.188e-02  2.580e-02   2.011 0.044304 *  
## cantonFR    -5.887e-03  2.828e-03  -2.081 0.037393 *  
## cantonGE     2.356e-02  2.798e-03   8.422  < 2e-16 ***
## cantonGL    -1.477e-02  9.298e-03  -1.589 0.112052    
## cantonGR    -4.665e-03  3.130e-03  -1.490 0.136096    
## cantonIT    -5.390e-02  2.196e-01  -0.245 0.806140    
## cantonJU     1.333e-02  6.361e-03   2.095 0.036139 *  
## cantonLU     4.176e-02  3.054e-03  13.672  < 2e-16 ***
## cantonNE     7.920e-03  4.100e-03   1.932 0.053376 .  
## cantonNW     3.072e-02  4.677e-03   6.568 5.09e-11 ***
## cantonOW     2.096e-01  9.571e-03  21.900  < 2e-16 ***
## cantonSG     4.770e-04  2.708e-03   0.176 0.860209    
## cantonSH    -4.001e-03  6.061e-03  -0.660 0.509171    
## cantonSO     1.688e-02  3.216e-03   5.248 1.54e-07 ***
## cantonSZ    -1.031e-02  4.936e-03  -2.090 0.036646 *  
## cantonTG     5.365e-03  3.891e-03   1.379 0.167999    
## cantonTI    -9.246e-03  2.500e-03  -3.698 0.000217 ***
## cantonUR     3.459e-02  5.781e-03   5.983 2.19e-09 ***
## cantonVD     5.211e-03  2.500e-03   2.085 0.037113 *  
## cantonVS     8.687e-03  2.978e-03   2.917 0.003538 ** 
## cantonZG     4.127e-02  4.955e-03   8.329  < 2e-16 ***
## cantonZH     1.564e-02  2.416e-03   6.473 9.62e-11 ***
## age         -6.752e-04  1.800e-05 -37.512  < 2e-16 ***
## cost         2.262e-06  6.001e-08  37.689  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2196 on 328220 degrees of freedom
## Multiple R-squared:  0.01164,    Adjusted R-squared:  0.01154 
## F-statistic: 120.8 on 32 and 328220 DF,  p-value: < 2.2e-16

The second question

How to define a subset of the predictors that is sufficient to predict the response?

The above developments bring a first answer to that question: use the \(F\)-statistic corresponding to the above defined test, where not all coefficients \(\beta\) are set equal to zero. One has just to eventually reorder the variables to get all the \(p-q\) kept variables at the beginning and to set the \(q\) last variables equal to zero.

Notice that in the summary table (above), for each individual predictor a \(t\)-statistic and a p-value are reported. These provide information about whether each individual predictor is related to the response, after adjusting for the other predictors. It turns out that each of these are exactly equivalent to the \(F\)-test that omits that single variable from the model, leaving all the others in, quivalent in that sense: the square of each \(t\)-statistic is the corresponding \(F\)-statistic. These \(t\)-tests report the partial effect of adding that variable to the model.

So we want to determine what variables are related to the response. We could look at each individual p-value of the corresponding \(t\)-test, but if \(p\) is large, we are likely to make some false discoveries, because the expected number of p-values being below the threshold of -say- 5% is more than one in our case (we have 32 predictors : \(32\cdot 5\% >1\)), so we expect at least one false discovery.

To solve this difficulty, one may use the BIC criterion for variable selection. In our case however, there is no real necessity for that.

The third question

Two of the most common measures of model fit are the RSE and the \(R^2\).

Recall that for simple regression, \(R^2\) is the square of the correlation of the response and the variable. In multiple linear regression, it turns out that it equals \(Cor(Y,\hat{Y})^2\). Consider the geometrical interpretation of \(R^2\) as \(cos^2(\beta)\). A fitted model maximizes this correlation among all possible linear models. A \(R^2\) close to 1 indicates that the model explains a large portion of the variance in the response. But when more variables are added to the model, the \(R^2\) will always increase. That makes the \(R^2\) an unappropriate measure of comparison between models with different numbers of variables.

The fourth question

Predictions: Consider the confidence interval for prediction

Bilateral interval of (1-)-level for a new \(y_{n+1}\): \[ \left[ x_{n+1}^T\cdot\hat{\beta}\,\pm\, t_{n-p}(1-\alpha/2)\cdot \hat{\sigma}\cdot \sqrt{x_{n+1}^T\cdot(X^TX)^{-1}\cdot x_{n+1}+1}\right]. \]